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(54) l\/laximum lilcelihood linear regression (MLLR) speaker adaptation using dynamic weighting 



(57) According to the prior art Maximum Ukelihood 
Linear Regression strategies have in common that the 
influence or weight of a new utterance remains the 
same throughout the whole adaptation process. 
According to the present invention, after a first very 
quick adaptation to a new speaker, new utterances are 
weighted less than all previous speaker specific utter- 
ances so that the sum of older utterances from a spe- 
cific speaks have much more irrf luence than a few new 
ones, if this speaker uses the system for a long time. 
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Description 

[0001 ] The present invention concerns a method to 
perform an unsupervised speaker adaptation for 
speech recognition systems that are using continous s 
density Hidden Markov Models (HMMs) with the Maxi- 
mum Ukelihood Linear Regression (MLLR) adaptation. 
[0002] State-of-the-art speech recognizers consist 
of a set of statistical distributions modeling the acoustic 
properties (encoded in feature vectors) of certain io 
speech segments. As a simple example, one Gaussian 
distribution is taken for each phoneme. These distribu- 
tions are attached to states. A stochastic model, usually 
continous density IHidden Markov Models, defines the 
propabiirties for sequences of states and for acoustic is 
properties given a state. Passing a state consumes one 
acoustic feature vector covering a frame of e. g. 10 ms 
of the speech signal. The stochastic parameters of such 
a recognizer are trained using a large amount of speech 
data either from a single speaker yielding a speaker 20 
dependent (SD) system or from many speakers yielding 
a speaker independent (SI) system. 
[0003] Nowadays most of the speech recognition 
systems using Hidden Markov Models to represent the 
different phonemes of a language are speaker inde- 2S 
pendent. However, normally state-of-the-art speaker 
dependent systems yield much higher recognition rates 
than speaker independent systems. Therefore, speaker 
adaptation is a wkiely used method to Increase the rec- 
ognition rates of speaker independent systems. How- 30 
ever, for many applications it is not feasible to gatiier 
enough data from a speaker to train the system. In case 
of a consumer device this might even not be wanted, 
since the device has to serve different users. To over- 
come this mismatch in recognition rates, speaker adap- 35 
tation algorithms are widely i^ed in order to achieve 
recognition rates that come dose to speaker dependent 
systems, but only use a fraction of speaker dependent 
data compared to speaker dependent systenrts. These 
systenrts initially take speaker independent models and 40 
adapt them so that they better match to the new speak- 
ers acoustics by the use of tiie speech received from 
said speaker (adaptation data). 
[0004] The basic principle of many speaker adapta- 
tion techniques is to modify the parameters of the Hid- 45 
den Markov Models, e. g. those of the Gaussian 
densities modeling the acoustic features. In Maximum 
Likelihood Linear Regression adaptation a transforma- 
tion matrix is calculated from the adaptation data and 
groups of model parameters, e. g. tiie mean vectors or so 
the variance vectors etc., are multiplied witii this trans- 
formation mati-ix (or n transformation matrices) to maxi- 
mize the likelihood of the adaptation data. 
[0005] Usually only tiie parameters of those Gaus- 
sian densities can be updated which corresponding ss 
phonemes have been observed in the adaptation data. 
In MLLR adaptation all Gaussian densities are clustered 
to build so-called regression classes. For each regres- 



sion class a separate transformation matrix is calcu- 
lated. Each time one or several phonemes from a 
specific regression class is/are observed in the adapta- 
tion data, a transformation matrix is calculated for tiiis 
dass and all Gaussian densities belonging to it are 
adapted. Thus, even those Gaussian densities for which 
tiie phonemes have not been observed in tiie adapta- 
tion data can be updated, what makes ttiis approach 
faster than conrparable ones, whereafter the next spo- 
ken utterance is analyzed witii tiie updated model 
parameters and tiie adaptation can be performed in a 
next adaptation step. 

[0006] As stated above, MLLR estimates linear 
ti^nsformations for groups of model parameters to max- 
imize the likelihood of the adaptation data. Up to now, 
MLLR has been applied to tiie mean parameters and 
tiie Gaussian variances in mixture-Gaussian HMM-sys- 
tems. 

[0007] The above-described method according to 
ttie state of tiie art obtains good results with ratiier big 
amounts of adaptation data. If only very small amounts 
of adaptation data are available for each adaptation 
step, i. e. often only one utterance, which might e. g. be 
a single word, ttie caloilation of tiie transformation 
matrices may partiy be erroneous, because tiie adapta- 
tion statistics are estimated on non-representive data. 
Therefore, it is the object underlying tiie present inven- 
tion to offer an improved method to perform an unsuper- 
vised speaker adaptation for continous density Hidden 
Markov Models using the Maximum Likelihood Linear 
Regression adaptation. 

[0008] This object is solved according to independ- 
ent claim 1, preferred embodiments are ddined in 
dependent subclaims 2 to 9. 
[0009] According to the inventive method a very fast 
adaptation can be achieved, since it is allowed to calcu- 
late the transformation matrix for each regression dass 
reliable after a single (or a few) utterance(s). which cor- 
respond to only a few seconds of speech. Therewith, 
also an on-line adaptation is possible. After the calcula- 
tion of the respective transformation matrix the group of 
parameters belonging to that regression class are 
updated and the next few seconds are then recognized 
using tiie HMMs that were modified in the previous step 
and so on. Therewitii, a very fast adaptation to a new 
speaker can be performed. 

[0010] The present invention will be better under- 
stood with the following detailed description of an exem- 
platory embodiment thereof taken in conjunction with 
Fig. 1 that shows a recognition and adaptation proce- 
dure including the dynamic weighting scheme and for- 
mulas (1) to (5). 

[001 1 ] The exemplatory embodiment of tiie present 
invention uses the mean parameters in mixture-Gaus- 
sian HMM-systems as group of model parameters that 
maximize the likelihood of ttie adaptation data. As men- 
tioned above, the present invention is not limited 
thereto, but also ttie Gaussian variances or another 
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group of model parameters can be used. 

Rg. 1 shows the recognition and adaptation proce- 
dure including the dynamic weighting scheme 
according to the present invention. s 

[0012] In an initialization step all the mean vectors 
^j of the Gaussian densities are assigned to one of the 
regression classes r so that the means are available 
as group of model parameters to maximize the likeli- io 
hood of the adaptation data. The regression classes 
could be designed using a standard Vector Quantization 
algorithm, i. e. clustering feature vectors according to 
any numerical distance measure, but also the use of 
regression class trees or any other method is possible, is 
The design and assignment of the regression classes r 
prior to recognition can also be performed dynamically 
depending on the amount of available adaptation data 
during the recognition process. 

[0013] After the incoming speech has been recog- 20 
nized in a step S1 . it is aligned to the model states in a 
step S2. Thereafter, the relevant statistics are extracted 
and used to calculate one transformation matrix W for 
each of the involved regression classes in step S3. New 
mean vectors Air computed using a corresponding 25 
transfornwtion matrix by applying the following 
equation (1): 

Air=WrV'"' (1) 

30 

in a step S4 to calculate the estimated means Aj^^ , 
where k is the current adaptation step and (k-1) the pre- 
vious one, respectively. This updating operation of the 
means pj^ in particular the computation of Wr is done 
according to known approaches of the prior art. 3S 
[0014] According to the prior art, these estimated 
means pK equal to the updated means that are 
used in the modified HMM-modets, as it Is shown in 
equation (la). and thereby all {ifj. from this regres- 
sion class are estimated from the adaptation data 40 
obsen/ed for this regression dass steps S€ and S7 are 
in the cunent adaptation step k. Thereafter, according to 
the prior art, steps 36 and S7 are directly performed 
wherein the HMM-models that are used without adapta- 
tion in a first adaptation step are adapted according to 45 
the following equation (la) 

[0015] As mentioned before, the procedure so 
desaibed so far works according to the procedure 
already known from the prior art. However, as also men- 
tioned above, this method works only reliable with a 
rather big amount of adaptation data. 
[0016] According to the present invention, on the ss 
other hand, the adapted mean according to equation (1) 
as it IS calculated in step S4 is not directly used to mod- 
ify the HMMs. but the adapted mean parameters are 



modified in a step S5. Therefore, a weighted sum of the 
"old" and "new" mean is used to modify the HMM-mod- 
els. 

[OOTT] In step S5, therefore, the updated mean ^[^ 
is not only calculated as in equation (la) above, where 
it directly corresponds to .but is basically calculated 
as follows in equation (2): 

ji}'r = arfi!l + (1-ar)n,V^ (2) 

where Or is a first weighting factor for a respective 
regression class r. The index k starts with 1 , are tiie 
mean vectors of tiie speaker-independent system. 
[0018] Witti a first fixed weighting factor - 0,001 
... 0,9 the new utterances represented by Aj^ are 
weighted so tiiat tiie data is adapted using a short term 
history having a a-dependent length. Therefore, tiie 
transformation matrices obtained on basis of small 
amounts of adaptation data that may be partiy errone- 
ous have a lower influence on the means used to modify 
tiie HMMs. but a fast adaptation is secured. Using tiiis 
weighted sum allows a fast adaptation with only small 
amounts of adaptation data (e. g., 1 utterance) so tiiat 
an on-line adaptation is possible. 
[0019] In a prefen-ed embodiment, tiie first weight- 
ing factor ttr is changed dynamically while the new 
speaker is using the system. A better performance can 
be gained when major changes to the HMMs are made 
when a new speaker starts using the system so tiiat the 
HMMs better match his acoustics to have very quickly a 
low failure rate. Later, the changes should become 
smaller and smaller so tiiat new utterances of a speaker 
are less weighted than all previous utterances of tiiis 
specific speaker so tiiat the sum of older utterances 
from tills specific speaker have much more influence 
than the newly spoken ones. Therefore, the number of 
frames that have been observed so far are taken into 
account to dynamically change the first weighting factor 
Or as follows in equation (3): 



[0020] Therefore. (1-a(f ) of equation (2) above is 
defined as follows in equation (4). 

k- 1 

0-«r) = i?r-k- W 



where n) is the number of frames tiiat were observed 
so far during adaptation step K 0<k^. in regression 
dass r and %) is a second weighting factor determining 
the initial influence of the speaker Independent models, 
which is determined heuristically x ^ can also be a con- 
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stant. 

[0021 ] According to a further preferred embodiment 
it is possible to allow any additional utterances after a 
first adaptation procedure to do only a fine adaptation, 
Tlierefore, the adaptation according to equations (3) 
and (4) will be weighted with the number of speech 
frames that have already been observed from a specific 
speaker so that the adaptation that was already done in 
the past is used with a bigger weight than the new utter- 
ances. To perform this, xjf increases by nj^ after each 
adaptation step so that 



is an initial value determined heuristically and may 
vary between several hundreds and several thousands 
(e.g. 100-10,000). 

[0022] TTierewith, taking all preferred embodiments 
into account, equation 2 above can be re-written as: 



k 



k-1 k-1 



k 

Mir 



k-1 ^k 

X, + n. 



(5) 



[0023] Using this weighting scheme, aj^ and thus 
the influence of the most recent observed means 
decreases over time and tiie parameters will move 
towards the optimum for that speaker after major 
changes to the HMM-nrwdels have been made when a 
new speaker starts using the system. 
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Method according to claim 2 or 3. characterized in 
that said first weighting factor (a) changes dynami- 
cally depending on tiie time a new specific speaker 
is using the system. 

Method according to claim 3. characterized in that 
said first weighting factor (a) decreases with the 
time said specific speaker uses the system. 

Method according to claim 4, characterized in that 
ttie first weighting factor (a) is calculated as follows: 



k-1 ^k* 
X +n 



wherein n** is the number of frames that were 
observed so far during adaptation step k and x*^*^ is 
a second weighting factor (x) that is determined 
heuristically. 

Method according to claim 5. characterized in that 
said second weighting factor (x) is dependent on 
ttie number of speech frames tiiat have already 
been observed from said specific speaker so that 
the adaptation already done has a bigger weight 
than newly spoken utterances. 

Metiiod according to claim 5 or 6, characterized in 
that said second weighting factor (x) increases by 
n*^ after each adaptation step. 



Cialms 



Method to perform an unsupervised speaker adap- 
tation for Continuous Density Hidden Markov Mod- 
els using the Maximum Likelihood Linear 
Regression adaptation, characterized in that an 
adapted group of parameters (a**) to maximize the 
likelihood of the adaptation data is a weighted sum 
of the previously adapted group of parameters (a*^* 
^) and the estimated group of parameters (a*^) 
according to the Maximum Likelihood Linear 
Regression. 

Method according to claim 1. characterized in that 
the weighted sum is calculated as follows: 



a = a ' 



a*' + (1-a)-a'"\ 



35 



40 



45. 



50 



wherein k is ttie actual adaptation step, a*^ is the 
adapted group of parameters that maximizes the 
likelihood of the adaptation data, a*^'^ is the adapted 
group of parameters tiiat previously maximized the 
likelihood of the adaptation data, a^ is the esti- 55 
mated group of parameters that is calculated with 
the Maximum Likelihood Linear regression and a is 
a first weighting factor. 



Method according to anyone of tfie preceeding 
claims, characterized in that said group of param- 
eters (a*^) to maximize the likelihood of the adapta- 
tion data are the mean parameters (n}<^ , i: number 
of the respective Gaussian class, r: number of the 
respective regression class within the respective 
Gaussian class) in mixture-Gaussian HMM sys- 
tems. 

Method according to anyone of the preceeding 
claims, characterized in that said group of param- 
eters (a*^) to maximize the likelihood of the adapta- 
tion data are ttie Gaussian variance parameters (Z) 
in mixture-Gaussian HMM systems. 
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